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The force field by Lenosky and coworkers is the latest force field for silicon which is one of 
the most studied materials. It has turned out to be highly accurate in a large range of test cases. 
The optimization and parallelization of this force field using OpenMp and FortanQO is described 
here. The optimized program allows us to handle a very large number of silicon atoms in large 
scale simulations. Since all the parallelization is hidden in a single subroutine that returns the total 
energies and forces, this subroutine can be called from within a serial program in an user friendly 
way. 



I. PROGRAM SURVEY 



(N 

o 
o 

(N 

Title of program: siliconiap 

l/^ . Computer hardware and operating system: Any shared memory computer running under Unix or Linux 
(N ■ Programming Language: Fortran90 with OpenMP compiler directives 

Memory requirements: roughly 150 words per atom 
^ _ No. of bits in a word: 64 
ly-^ . No. of processors used: tested on up to 4 processors 

— ' Has the code been vectorized or parallelized: Parallelized withe OpenMP 
■"sj" . No. of bytes in distributed program, including test data, etc: 50 000 
Distribution format: Compressed tar file 

Keywords: Silicon, Interatomic potential. Force field. Molecular dynamics 
' Nature of physical problem: condensed matter physics 

Method of solution: Interatomic potential 
' Restrictions on the complexity of the problem: None 
^ . Typical running time: 30 /isec per step and per atom on a Compaq DEC Alpha 
I ' Unusual features of the program: None 

o 
o 



II. INTRODUCTION 



Due to its technological importance, silicon is one of the most studied materials. For small system sizes ab- 
initio density functional calculations are the preferred approach. Unfortunately this kind of calculation becomes 
unfeasible for larger systems required to study problems such as interfaces or extended defects. For this type of 
■ ■ ■ ' calculations one resorts to force fields which are several orders of magnitude faster. Recent progress in the development 
of force fields has demonstrated that they can be a reliable tool for such studies. A highly accurate silicon force field has 
been developed by Lenosky and coworkers Q . Its transferability has been demonstrated by extensive tests containing 
both bulk and cluster systems ^ . Its accuracy is in part due to the fact that second nearest neighbor interactions are 
included. This makes it unfortunately somewhat slower than force fields containing only nearest neighbor interactions. 
In the following a highly optimized parallel implementation of this force field will be presented that allows large scale 
calculations with this force field. The parallelization is achieved by using OpenMP, an emerging industry standard 
for medium size shared memory parallel computers. 

Molecular dynamics calculations Q have also been parallelized on distributed memory supercomputers This 
approach is considerably more complex than the one presented here. Since few researches have access to massively 
parallel supercomputers and are willing to overcome the complexities of doing molecular dynamics on such machines, 
medium scale parallelization of molecular dynamics has an important place in practice. 
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III. CALLING THE SUBROUTINE 



User friendliness was one of the major design goals in the development of this routine. Using Fortran90 made it 
possible to hide all the complexities in an object oriented fashion from the user. The calling sequence is just 

call lenosky (nat , alat , rxyz , f xyz , ener , coord , ener_var , coord_var , count) 

On input the user has to specify the number of atoms, not, the vector alat containing the 3 lattice constant of the 
orthorhombic periodic volume and the atomic positions rxyz. The program then returns the total energy, ener, the 
forces, fxyz, the average coordination number, the variation of the energy per atom and of the coordination number 
as well as an counter that is increased in each call. In particular the user has not to supply any Verlet list. 

Since the calculation of the forces is typically much more expensive than the update of the atomic positions in 
molecular dynamics or geometry optimizations, we expect that the subroutine will be called in most cases from within 
a serial program. In case the user is on a shared memory machine the subroutine will then nevertheless be executed 
in parallel if the program is compiled with the appropriate OpenMP options. 

In addition the subroutine can of course also be used on a serial machine. In this case all the parallelization 
directives are considered by the compiler to be comments. 

IV. CALCULATION OF THE VERLET LIST 

The Verlet list gives all the atoms that are contained within the potential cutoff distance cut of any given atom. 
Typically the Verlet list consists of two integer arrays. The first array, called Ista in this work, points to the 
first/last neighbor position in the second array Isth that contains the numbering of the atoms that are neighbors. 
A straightforward implementation for a non-periodic system containing nat atoms is shown below. In this simple 
case the search through all atoms is sequential with respect to their numbering and it is redundant to give both the 
starting positions lsta{l,iat) and the ending position lsta(2,iat), since lsta(\,iat) = lsta{2,iat — 1) + 1. But in the 
more complicated linear scaling algorithm to be presented below, both will be needed. 

indc=0 
do 10 iat=l,nat 
c starting position 

lsta(l , iat)=indc+l 
do 20 jat=l,nat 
if (jat.ne.iat) then 

xrell= rxyz (1 , jat) -rxyzd , iat) 
xrel2= rxyz (2, jat) -rxyz (2, iat) 
xrel3= rxyz (3, jat) -rxyz (3, iat) 
rr2=xrell**2 + xrel2**2 + xrel3**2 
if ( rr2 .le. cut**2 ) then 
indc=indc+l 
c nearest neighbor numbers 

lstb(indc)=jat 
endif 
endif 
20 continue 
c ending position 

Ista (2 , iat)=indc 
10 continue 

This straightforward implementations has a quadratic scaling with respect to the numbers of atoms. Due to this 
scaling the calculation of the Verlet list starts to dominate the linear scaling calculation of the energies and forces 
for system sizes of more than 10 000 atoms. It is therefore good practice to calculate the Verlet list with a modified 
algorithm that has linear scaling |^,|| as well. 

To do this one first subdivides the system into boxes that have a side length that is equal to or larger than cut 
and then finds all the atoms that are contained in each box. The CPU time for this first step is less than 1 percent 
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of the entire Verlet list calculation. Hence this part was not parallelized. It could significantly affect the parallel 
performance according to Amdahls law only if more than 50 processors are used. The largest SMP machines at 
our disposal had however only 4 processors. 

To implement periodic boundary conditions all atoms within a distance cut of the boundary of the periodic volume 
are replicated on the opposite part as shown in Figure |l|. This part is equally well less than 1 percent of the CPU 
time for the Verlet list for a 8000 atom system. Being a surface term it becomes even smaller for larger systems. 
Consequently it wasn't parallelized either. 
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FIG. 1. Illustration of the construction of the cell structure necessary for a linear scaling calculation of the nearest neighbor 
list for a 2-dimensional case. The periodic volume is indicated by the dark background. The bright cells are replicated dark 
cells. 

After these two preparing steps one has to search only among all the atoms in the reference cell containing the 
atom for which one wants to find its neighbors as well as all the atoms in the cells neighboring this reference cell (26 
cells in 3 dimensions). This implies that starting and ending positions Ista for the atoms 1 to nat are not calculated 
in a sequential way, necessitating, as mentioned before, separate starting and ending positions in the array Ista. 
The corresponding parallel code is shown below. The indices 11, 12, 13 refer to the cells, icell{ll, 12, 13, 0) contains the 
number of atoms in cell 11, 12, 13 and icell{ll, 12, 13, *) their numbering. The array rel saves the relative positions and 
distances that will again be needed in the loop calculating the forces and energies. Each thread has its own starting 
position iam * myspace + 1 in the shared memory space Istb and these starting positions are uniformly distributed. 
This approach allows the different threads to work independently. The resulting speedup is much higher than the one 
that one would obtain by calculating in the parallel version an array Istb that is identical to the one from the serial 
version. If there are on the average more neighbors than expected (24 by default) the allocated space becomes too 
small. In this case the array Istb is deallocated and a new larger version is allocated. This check for sufficient memory 
requires some minimal amount of coordination among the processors and is implemented by a critical section. If a 
reallocation is necessary, a message is written into an file to alert the user of the inefficiency due to the need of a 
second calculation of the Verlet list. 

allocate (Ista (2 , nat) ) 
nnbrx=24 
2345 nnbrx=3*nnbrx/2 

allocate (Istb (nnbrx*nat) ,rel(5,nnbrx*nat) ) 

indlstx=0 

! $omp parallel & 

!$omp private(iat,cut2,icmi,ii,indlst, 11, 12, 13, myspace, npr) & 

!$omp shared (indlstx, nat, nn,nnbrx,ncx, 111, 112, 113, icell, Ista, Istb, lay, & 

!$omp rel,rxyz,cut,myspaceout) 



3 



npr=l 

! $ npr=omp_get_num_threads 

icLm=0 

!$ iciin=omp_get_thread._nuin() 
cut2=cut**2 

myspace=(nat*nnbrx) /npr 

if (iam.eq.O) myspaceout=myspace 
! Verlet list, relative positions 

indlst=0 
do 6000,13=0,113-1 
do 6000,12=0,112-1 
do 6000,11=0,111-1 
do 6600,ii=l,icell(0,ll,12,13) 

iat=icell(ii,ll,12,13) 

if ( ( (iat-1) *npr) /nat .eq. iam) then 

Istad ,iat)=iain*myspace+indlst+l 

call sublstiat(iat,nn,ncx,lll,112,113,ll,12,13,myspace, & 

rxyz , icell , lstb(i£uii*myspace+l) , lay ,rel(l , iam*myspace+l) , cut2, indlst) 
lsta(2 , iat) =iam*myspace+indlst 
endif 

6600 continue 
6000 continue 
!$omp critical 

indl St x=mcLx ( indl s tx , indl st ) 
!$omp end critical 
! $onip end parallel 

if (indlstx.gt .myspaceout) then 

writedO,*) count , 'NNBRX too small', nnbrx 

deallocate (Istb.rel) 

goto 2345 
endif 



subroutine sublstiat(iat,nn,ncx,lll,112,113,ll,12,13,inyspace, & 

rxyz , icell , Istb , lay , rel , cut2 , indlst) 
implicit real*8 (a-h,o-z) 

dimension rxyz(3,nn) ,lay (nn) , icell(0 :ncx, -1 : 111 ,-1 : 112,-1 : 113) , & 
lstb(0:myspace-l) ,rel(5,0:myspace-l) 

do 6363, k3=13-l, 13+1 

do 6363, k2=12-l, 12+1 
do 6363, kl=ll-l, 11+1 
do 6363, jj=l, icell (0,kl,k2,k3) 

jat=icell(j j ,kl ,k2,k3) 

if (jat.eq.iat) goto 6363 

xrel= rxyz(l,iat)-rxyz(l, jat) 

yrel= rxyz (2 , iat) -rxyz (2 , jat) 

zrel= rxyz (3, iat) -rxyz (3, jat) 

rr2=xrel**2 + yrel**2 + zrel**2 

if ( rr2 .le. cut2 ) then 
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iiidlst=miii ( indlst , myspace- 1) 
lstb(indlst)=lay (jat) 
tt=sqrt(rr2) 
tti=l .dO/tt 

rel (1 , indlst) =xrel*tti 
rel (2 , indlst) =yrel*tti 
rel (3 , indlst) =zrel*tti 
rel (4, indlst) =tt 
rel (5 , indlst) =tti 
indlst= indlst+1 
endif 

6363 continue 

return 
end 

V. CALCULATION OF THE ENERGIES AND FORCES 

The computationally most important part taking some 80 percent of the CPU time is the calculation of the energies 
and forces. The energy expression for the Lenosky force field is given by 

E^Y. •^('^^j) + E ( E ^('•'j) + E fin,)f{nk)9{cos{9,.k)) J (1) 

i,j i \ j j-k J 

All the functions (0, J7, p, /, g) in this energy expression are given by cubic splines. The subroutine for evaluating the 
cubic spline is listed below. The case that the argument is outside the cubic spline interval [tmin, tmax] is rare and 
unimportant for performance considerations. The important cubic spline case is characterized by many dependencies. 
In the case of such dependencies the latency of the functional unit pipeline comes into play and reduces the attainable 
speed 0. A latency of some 20 cycles comes from the first two statements ( tt=(x-tmin)*hi ; klo=tt) alone, requiring 
arithmetic operations and a floating point to integer conversion. For this reason the calculation of tt was taken out 
of the (most likely occurring) else block to overlap its evaluation with the evaluation of the if clauses. To further 
speed up the evaluation of the splines the structure of the energy expression |^ was exploited. In the computationally 
most important loop over j and k two splines {f{rik) and g{cos{9jik)) ) have to be evaluated. Inlining by hand the 
subroutine splint for both evaluations and calculating alternatingly one step of the first spline evaluation and one 
step of the second spline evaluation introduces two independent streams. This reduces the effect of latencies and 
boosts speed. Compilers are not able to do these complex type of optimizations. The best performance after these 
optimizations was obtained with low level compiler optimization flags (-03 -qarch=pwr3 -qtune-pwrS on IBM PowerS, 
-02 on the Compaq DEC Alpha, -02 -xW on Intel Pentium4 ) 

subroutine splint (ya , y2a , tmin , tmax ,hsixth , h2sixth , hi , n , x , y , yp) 
implicit real*8 (a-h,o-z) 
dimension y2a(0:n-l) ,ya(0:n-l) 

! interpolate if the argument is outside the cubic spline interval [tmin, tmax] 
tt= (x-tmin) *hi 
if (x. It. tmin) then 

yp=hi*(ya(l)-ya(0)) - & 

( y2a(l)+2.d0*y2a(0) )*hsixth 

y=ya(0) + (x-tmin) *yp 
else if (x.gt.tmax) then 

yp=hi*(ya(n-l)-ya(n-2)) + k 

( 2.d0*y2a(n-l)+y2a(n-2) )*hsixth 

y=ya(n-l) + (x-tmax)*yp 
! otherwise evaluate cubic spline 
else 
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klo=tt 

khi=klo+l 

ya_klo=ya (klo) 

y2a_klo=y2a(klo) 

b=tt-klo 

a=l.dO-b 

ya_khi=ya (khi ) 

y2a_khi=y2a(khi) 

b2=b*b 

y=a*ya_klo 

yp=ya_khi-ya_klo 

a2=a*a 

cof I=a2-l.d0 

cof2=b2-l.d0 

y=y+b*ya_khi 

yp=hi*yp 

cof3=3.d0*b2 

cof4=3.d0*a2 

cof l=a*cof 1 

cof 2=b*cof 2 

cof3=cof3-l.d0 

cof4=cof4-l.dO 

ytl=cof l*y2a_klo 

yt2=cof 2*y2a_khi 

yptl=cof3*y2a_khi 

ypt2=cof4*y2a_klo 

y=y + (ytl+yt2)*h2sixth 

yp=yp + ( yptl - ypt2 )*hsixth 
endif 
return 
end 

The final single processor performance for tlic entire subroutine is 460 Mflops on a Compaq DEC Alpha at 833 
MHz, 300 Mflops on a IBM PowerS at 350 MHz and 550 Mflops on a Pentium 4. In order to obtain a high parallel 
speedup in this central part of the subroutine the threads are completely decoupled. This was done by introducing 
private copies for each thread to accumulate the energies tener and forces txyz. The global energy and force are 
summed up in an additional loop at the end of the parallel region in a critical section. 

! $omp parallel & 

!$omp private(icim,npr,iat,iatl,iat2,lot,istop,tcoord,tcoord2, & 
!$omp tener, tener2, txyz, f 2ij ,f3ij ,f3ik,npjx,npjkx) & 

! $omp shared (nat , nnbrx , Ista, Istb , rel , ener , ener2 , f xyz , coord , coord2 , istopg) 
npr=l 

! $ npr=onip_get_num_ threads 

iam=0 

!$ iam=omp_get_thread_num() 

npjx=300 ; npjkx=3000 
istopg=0 

if (npr.ne.l) then 
! PARALLEL CASE 

! create temporary private scalars for reduction sum on energies eind 
! temporary private array for reduction sum on forces 
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!$omp critical 

allocate (txyz(3,nat) ,f 2ij (3,npjx) ,f3ij (S.npjkx) ,f3ik(3,npjkx)) 
!$omp end critical 

if (iam.eq.O) then 

ener=0 . dO 

ener2=0 . dO 

coord=0 . dO 

coord2=0.d0 

do 121 , iat=l ,nat 

fxyz(l,iat)=0.dO 

fxyz(2,iat)=0.d0 
121 fxyz(3,iat)=0.d0 
endif 

lot=nat/npr+ . 999999999999d0 
iatl=iain*lot+l 

iat2=min( (iam+l) *lot ,nat) 

call subfeniat(iatl,iat2,nat,lsta,lstb,rel,tener,tener2, & 

t coord, tcoord2,nnbrx,txyz,f2ij ,npjx,f3ij ,npjkx,f 3ik, istop) 

!$omp critical 

ener=ener+tener 
ener2=ener2+tener2 
coord=coord+t coord 
coord2=coord2+tcoord2 
istopg=istopg+istop 
do 8093,iat=l,nat 

fxyz(l,iat)=fxyz(l,iat)+txyz(l,iat) 

fxyz(2,iat)=fxyz(2,iat)+txyz(2,iat) 

f xyz(3 , iat)=f xyz(3, iat)+txyz(3, iat) 
8093 continue 
!$omp end critical 

deallocate (txyz , f 2i j , f 3i j , f 3ik) 

else 
! SERIAL CASE 
iatl=l 
iat2=nat 

allocate (f2ij (3,npjx) ,f3ij (3,npjkx) ,f3ik(3,npjkx)) 

call subf eniat(iatl,iat2,nat,lsta,lstb,rel,ener,ener2, & 

coord , coord2 , nnbrx , f xyz , f 2i j , np j x , f 3i j , np jkx , f 3ik , istop) 
deallocate(f2ij ,f3ij ,f3ik) 

endif 
!$omp end parallel 

if (istopg.gt.O) stop 'DIMENSION ERROR (see WARNING above)' 

ener_var=ener2/nat-(ener/nat) **2 

coord=coord/nat 

coord_var=coord2/nat-coord**2 

deallocate (rxyz , icell , lay , Ista, Istb, rel) 

end 



subroutine subf eniat ( iat 1 , iat2 , nat , Ista , Istb , rel , tener , tener2 , & 
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tcoord,tcoord2,nnbrx,txyz,f 2ij ,npjx,f3ij ,npjkx,f3ik,istop) 

implicit real*8 (a-h,o-z) 

dimension lsta(2,nat) ,lstb(nnbrx*nat) ,rel(5,nnbrx*nat) ,txyz(3,nat) 
dimension f2ij (S.npjx) ,f3ij (3,npjkx) ,f 3ik(3,npjkx) 

initialize data 

! create temporary private scalars for reduction sum on energies and 
tener=0 . dO 
tener2=0.d0 
tcoord=0.dO 
tcoord2=0.d0 
istop=0 

do 121 , iat=l ,nat 
txyz(l,iat)=0.dO 
txyz(2,iat)=0.d0 
121 txyz(3,iat)=0.d0 

! calculation of forces, energy 

do 1000,iat=iatl,iat2 

dens2=0 . dO 
dens3=0 . dO 
j cnt=0 

jkcnt=0 

coord_iat=0 . dO 
ener_iat=0.dO 

do 2000, jbr=lsta(l,iat) ,lsta(2,iat) 

jat=lstb(jbr) 

jcnt=j cnt+1 

if (jcnt .gt .npjx) then 

write(6,*) 'WARNING: enlarge npjx' 

istop=l 
endif 

fxij=rel(l, jbr) 
fyij=rel(2,jbr) 
fzij=rel(3, jbr) 
rij=rel(4, jbr) 
sij=rel(5, jbr) 

! coordination number calculated with soft cutoff between first and 
! second nearest neighbor 

if (rij .le.2.36d0) then 

coord_iat=coord_iat+l . dO 

else if (rij .ge.3.83d0) then 

else 

x= (rij -2 . 36d0) * ( 1 . dO/ (3 . 83d0-2 . 36d0) ) 
coord_iat=coord_iat+ (2*x+l . dO) * (x-1 . dO) **2 
endif 

! pairpotential term 

call splint(cof _phi,dof_phi,tmin_phi,tmax_phi, & 

hsixth_phi ,h2sixth_phi , hi_phi , 10 , rij , e_phi , ep_phi) 
ener_iat=ener_iat+(e_phi* . 5d0) 
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txyz ( 1 , iat ) =txyz ( 1 , iat ) -f xi j * (ep_phi* . 5d0) 
txyz (2 , iat ) =txyz (2 , iat ) -f yi j * (ep_phi* . 5d0) 
txyz (3 , iat ) =txyz (3 , iat) -f zi j * (ep_ph.i* . 5d0) 
txyz ( 1 , j at ) =txyz ( 1 , j at ) +f xi j * (ep_phi* . 5d0) 
txyz (2 , j at ) =txyz (2 , j at ) +f yi j * (ep_phi* . 5d0) 
txyz (3 , j at ) =txyz (3 , j at) +f zi j * (ep_ph.i* . 5d0) 

! 2 body embedding term 

call spliiit(cof_rho,dof_rho,tmin_rho,tmax_rho, & 

hsixth_rho , h2sixth_rho , hi_rho ,11, ri j , rho , rhop) 
dens2=deiis2+rho 
f 2ij (1 , jcnt)=fxij*rhop 
f2ij (2, jcnt)=fyij*rhop 
f2ij (3, jcnt)=fzij*rhop 

! 3 body embedding term 

call splint (cof_fff ,dof_fff ,tmin_fff ,tmax_fff , & 

hsixth_fff ,h2sixth_fff ,hi_fff ,10,rij ,f ij ,fijp) 

do 3000, kbr=lsta(l, iat) ,lsta(2, iat) 

kat=lstb(kbr) 

if (kat.lt.jat) then 

jkcnt=jkcnt+l 

if (jkcnt .gt .npjkx) then 

write (6,*) 'WARNING: enlarge npjkx' 

istop=l 
endif 

! begin optimized version 
rik=rel(4,kbr) 
if (rik.gt .tmax_f f f ) then 
fikp=0.dO ; fik=0.dO 
gjik=0.dO ; gjikp=0.dO ; sik=0.dO 
costheta=0.dO ; fxik=0.dO ; fyik=0.dO ; fzik=0.dO 
else if (rik. It . tmin_f f f ) then 
fxik=rel(l,kbr) 
fyik=rel(2,kbr) 
f zik=rel (3 ,kbr) 

costheta=f xi j *f xik+f yi j *f yik+f zi j *f zik 
sik=rel(5,kbr) 

fikp=hi_fff*(cof_fff (l)-cof_fff (0)) - & 

( dof_fff (l)+2.d0*dof_fff (0) )*hsixth_fff 
fik=cof_fff (0) + (rik-tmin_fff)*fikp 
tt_ggg= (costheta-tmin_ggg) *hi_ggg 
if (costheta.gt .tmax_ggg) then 

gjikp=hi_ggg*(cof_ggg(8-l)-cof_ggg(8-2)) + & 

( 2.d0*dof_ggg(8-l)+dof_ggg(8-2) )*hsixth_ggg 
gjik=cof _ggg(8-l) + (costheta-tmax_ggg) *gj ikp 

else 

klo_ggg=tt_ggg 

khi-ggg=klo_ggg+l 

cof _ggg_klo=cof _ggg (klo_ggg) 

dof_ggg_klo=dof_ggg(klo_ggg) 

b-ggg=tt_ggg-klo_ggg 

a-ggg=l • <iO-b_ggg 

cof -ggg-khi=cof _ggg (khi_ggg) 
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dof _ggg_khi=dof _ggg (khi_ggg) 
b2_ggg=b_ggg*b_ggg 
gj ik=a_ggg*cof _ggg_klo 
gj ikp=cof _ggg_khi-cof _ggg_klo 
a2_ggg=a_ggg*a_ggg 
cof l_ggg=a2_ggg-l . dO 
cof 2_ggg=b2_ggg-l . dO 
gj ik=gj ik+b_ggg*cof _ggg_khi 
gj ikp=hi_ggg*gj ikp 
cof3_ggg=3.d0*b2_ggg 
cof 4_ggg=3 . d0*a2_ggg 
cof l_ggg=a_ggg*cof l_ggg 
cof2_ggg=b_ggg*cof2_ggg 
cof 3_ggg=cof 3_ggg-l . dO 
cof 4_ggg=cof 4_ggg-l . dO 
ytl-ggg=cofl_ggg*dof_ggg_klo 
yt2_ggg=cof2_ggg*dof_ggg_khi 
yp1^l-ggg=cof3_ggg*dof_ggg_khi 
ypt2_ggg=cof4_ggg*dof_ggg_klo 
gjik=gjik + (ytl_ggg+yt2_ggg)*h2sixth_ggg 
gjikp=gjikp + ( yptl_ggg - ypt2_ggg )*hsixth_ggg 
endif 
else 

fxik=rel(l,kbr) 
tt_fff=rik-tmin_fff 
costheta=f xi j *f xik 
fyik=rel(2,kbr) 
tt_fff=tt_fff*hi_fff 
costheta=costheta+f yi j *f yik 
fzik=rel(3,kbr) 
klo_fff=tt_fff 
costheta=costheta+f zi j *f zik 
sik=rel (5 ,kbr) 

-ggg= ( CO sthet a-tmin_ggg) *hi _ggg 
if (costheta.gt .tmax_ggg) then 

gjikp=hi_ggg*(cof_ggg(8-l)-cof_ggg(8-2)) + & 

( 2.d0*dof_ggg(8-l)+dof_ggg(8-2) )*hsixth_ggg 
gjik=cof _ggg(8-l) + (costheta-tinax_ggg)*gjikp 
khi_fff=klo_fff+l 
cof _f f f _klo=cof _f f f (klo_f f f ) 
dof _f f f _klo=dof _f f f (klo_f f f ) 
b_fff=tt_fff-klo_fff 
a_fff=l.dO-b_fff 
cof _f f f _khi=cof _f f f (khi_f f f ) 
dof _f f f _khi=dof _f f f (khi_f f f ) 
b2_fff=b_fff*b_fff 
fik=a_fff*cof_fff_klo 
f ikp=cof_fff_khi-cof_fff_klo 
a2_fff=a_fff*a_fff 
cof I_fff=a2_fff-l.d0 
cof2_fff=b2_fff-l.d0 
fik=fik+b_fff*cof_fff_khi 
f ikp=hi_fff *f ikp 
cof 3_f f f =3 . d0*b2_f f f 
cof 4_f f f =3 . d0*a2_f f f 
cof l_fff=a_fff*cofl_fff 
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cof2_fff=b_fff*cof2_fff 
cof 3_f f f =cof 3_f f f-1 .dO 
cof 4_f f f =cof 4_f f f -1 . dO 
ytl_fff=cof l_fff*dof_fff_klo 
yt2_fff=cof2_fff*dof_fff_khi 
yptl_fff=cof3_fff*dof_fff_khi 
ypt2_fff=cof4_fff*dof_fff_klo 
fik=fik + (ytl_fff+yt2_fff)*h2sixth_fff 
fikp=fikp + ( yptl_fff - ypt2_fff ) *hsixth_f f f 
else 

klo_ggg=tt_ggg 

klii-ggg=klo_ggg+l 
khi_fff=klo_fff+l 

cof_ggg_klo=cof_ggg(klo_ggg) 
cof _f f f _klo=cof _f f f (klo_f f f ) 

dof _ggg_klo=dof _ggg (klo_ggg) 
dof _f f f _klo=dof _f f f (klo_f f f ) 

ti-ggg=tt_ggg-klo_ggg 
b_fff=tt_fff-klo_fff 

a_ggg=l.dO-b_ggg 
a_fff=l.dO-b_fff 

cof_ggg_khi=cof_ggg(khi_ggg) 
cof _f f f _klii=cof _f f f (khi_f f f ) 

dof_ggg_khi=dof_ggg(khi_ggg) 
dof _f f f _khi=dof _f f f (khi_f f f ) 

ti2_ggg=b_ggg*b_ggg 
b2_fff=b_fff*b_fff 

gj ik=a_ggg*cof _ggg_klo 
fik=a_fff*cof_fff_klo 

gj ikp=cof _ggg_khi-cof _ggg_klo 
fikp=cof_fff_khi-cof_fff_klo 

a2_ggg=a_ggg*a_ggg 
a2_fff=a_fff*a_fff 

cofl_ggg=a2_ggg-l.d0 
cof I_fff=a2_fff-l.d0 

cof 2_ggg=b2_ggg-l . dO 
cof 2_f f f =b2_f f f -1 . dO 

gj ik=gj ik+b_ggg*cof _ggg_khi 
f ik=f ik+b_fff*cof_fff_khi 

gj ikp=lii-ggg*gj ikp 
fikp=hi_fff*fikp 

cof 3_ggg=3 . d0*b2_ggg 
cof 3_f f f =3 . d0*b2_f f f 

cof 4_ggg=3 . d0*a2_ggg 
cof4_fff=3.d0*a2_fff 

cof l_ggg=a_ggg*cof l_ggg 
cof l_fff=a_fff*cof l_fff 

cof 2_ggg=b_ggg* cof 2_ggg 
cof2_fff=b_fff*cof2_fff 

cof 3_ggg=cof 3_ggg-l . dO 
cof 3_f f f =cof 3_f f f-1 .dO 

cof 4_ggg=cof 4_ggg-l . dO 
cof4_fff=cof4_fff-l .dO 

ytl-ggg=cof l_ggg*dof_ggg_klo 
ytl_fff=cofl_fff*dof_fff_klo 

2_ggg=cof 2_ggg*dof _ggg_khi 
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yt2_fff=cof2_fff*dof_fff_khi 

ypt 1 -ggg= c o f 3_ggg* do f _ggg_khi 
yptl_fff=cof3_fff*dof_fff_khi 

ypt2_ggg=cof4_ggg*dof_ggg_klo 
ypt2_fff=cof4_fff*dof _fff_klo 

gjil^=gjik + (ytl_ggg+yt2_ggg)*h2sixth_ggg 
fik=fik + (ytl_fff+yt2_fff)*h2sixth_fff 

gjikp=gjikp + ( yptl_ggg - ypt2_ggg )*hsixth_ggg 
fikp=fikp + ( yptl_fff - ypt2_fff ) *hsixth_f f f 
endif 
endif 

! end optimized version 

tt=f ij*f ik 
dens3=dens3+tt*gj ik 

tl=f ijp*f ik*gj ik 
t2=sij*(tt*gj ikp) 

f3ij (1, jkcnt)=fxij*tl + (fxik-fxij*costheta)*t2 

f3ij (2, jkcnt)=fyij*tl + (f yik-f yij*costheta)*t2 
f3ij (3, jkcnt)=fzij*tl + (f zik-f zij*costheta)*t2 

t3=fikp*f ij*gjik 
t4=sik*(tt*gj ikp) 

f3ik(l, jkcnt)=fxik*t3 + (fxij-fxik*costheta)*t4 
f3ik(2, jkcnt)=fyik*t3 + (fyij-f yik*costheta)*t4 
f3ik(3, jkcnt)=fzik*t3 + (f zij-f zik*costheta)*t4 
endif 

3000 continue 
2000 continue 

dens=dens2+dens3 

call splint (cof_uuu,dof_uuu,tmin_uuu,tmax_uuu, & 

hsixth._uuu , h.2sixth_uuu , h.i_uuu , 8 , dens , e_uuu , ep_uuu) 
ener_iat=ener_iat+e_uuu 

! Only now ep_uu is known and the forces can be calculated, lets loop again 
j cnt=0 
jkcnt=0 

do 2200, jbr=lsta(l,iat) ,lsta(2,iat) 
jat=lstb(jbr) 

j cnt=j cnt+1 

txyz ( 1 , iat ) =txyz ( 1 , iat ) -ep_uuu*f 2i j ( 1 , j cnt) 
txyz(2, iat)=txyz(2, iat)-ep_uuu*f 2ij (2, jcnt) 
txyz (3 , iat ) =txyz (3 , iat ) -ep_uuu*f 2i j (3 , j cnt) 
txyzd , jat)=txyz(l , jat)+ep_uuu*f 2ij (1, jcnt) 
txyz(2, jat)=txyz(2, jat)+ep_uuu*f 2ij (2, jcnt) 
txyz (3 , j at ) =txyz (3 , j at ) +ep_uuu*f 2i j (3 , j cnt) 

! 3 body embedding term 

do 3300, kbr=lsta(l, iat) ,lsta(2, iat) 

kat=lstb(kbr) 

if (kat.lt.jat) then 

jkcnt=jkcnt+l 

txyzd , iat)=txyz(l , iat) -ep_uuu* (f 3ij (1 , jkcnt)+f 3ik(l , jkcnt)) 
txyz(2,iat)=txyz(2,iat)-ep_uuu*(f3ij (2, jkcnt)+f3ik(2, jkcnt)) 
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txyz (3 , iat ) =txyz (3 , iat ) -ep_uuu* (f 3i j (3 , jkcnt) +f 3ik (3 , jkcnt ) ) 

txyz ( 1 , j at ) =txyz ( 1 , j at ) +ep_uuu*f 3i j ( 1 , j kcnt ) 

txyz (2 , j at ) =txyz (2 , j at ) +ep_uuu*f 3i j (2 , jkcnt ) 

txyz (3 , j at ) =txyz (3 , j at ) +ep_uuu*f 3i j (3 , jkcnt ) 

txyz(l ,kat)=txyz(l ,kat)+ep_uuu*f 3ik(l , jkcnt) 

txyz(2,kat)=txyz(2, kat ) + ep_uuu* f 3ik(2,jkcnt) 

txyz (3 , kat ) =txyz (3 , kat ) +ep_uuu*f 3ik (3 , jkcnt ) 

endif 

3300 continue 
2200 continue 

tener=tener+ener_iat 
tener2=tener2+ener_iat**2 
tcoord=tcoord+coord_iat 
tcoord2=tcoord2+coord_iat**2 

1000 continue 

return 
end 

In addition to the energy and the forces the program still returns the coordination number as well as the variance of 
the energy per atom and the coordination number. The coordination number is calculated using a soft cutoff between 
the first and second nearest neighbor distance. These extra calculations are very cheap and not visible as an increase 
in the CPU time 

VI. PARALLEL PERFORMANCE RESULTS 

Table | shows the final overall speedups obtained by the program. The results were obtained for an 8000 atom 
system, but the CPU time per call and atom is nearly independent of system size. 



TABLE I. Timings in ^sec for a combined evaluation of the forces and the energy per particle as well as the corresponding 
speedups (in parentheses) on an IBM SP3 based on a 375 MHz PowerS processor, on a Compaq SC 232 based on a 833Mhz 
EV67 processor and on an Intel Pentium4 biprocessor at 2 GHz 



number of processors 


IBM Power3 


Compaq EV67 


Intel P4 


1 


46 


30 


25 


2 


24 (1.9) 


16 (1.9) 


13 (1.9) 


4 


13 (3.5) 


8.6 (3.5) 




8 


7.7 (6.0) 
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Obtaining such high speedups was not straightforward. Only the Compaq Fortran90 compiler was able to use in the 
original version of the program the OpenMP 'parallel do' directive to obtain a good speedup. Both the IBM compiler 
and the Intel compiler failed. In order to get the performances of Table ^ it was necessary to encapsulate the workload 
of the different threads into the subroutines 'sublstias' and 'subfen', which amounts to doing the parallelization quasi 
by hand. Using allocatable arrays in connection with OpenMP turned also out to be tricky. Because of these problems, 
the parallelization was much more painful that one might expect for a shared memory model. 



VII. CONCLUSIONS 



The results show that simulations for very large silicon systems are feasible on relatively cheap serial or parallel 
computers accessible to a large number of researches. 

I thank Tom Lenosky for his help in the implementation of the interatomic potential and Cyrus Umrigar for 
providing me with the timings on the IBM machine. 
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